Stemming Indonesian

نویسندگان

  • Jelita Asian
  • Hugh E. Williams
  • Seyed M. M. Tahaghoghi
چکیده

Stemming words to (usually) remove suffixes has applications in text search, machine translation, document summarisation, and text classification. For example, English stemming reduces the words “computer”, “computing”, “computation”, and “computability” to their common morphological root, “comput-”. In text search, this permits a search for “computers” to find documents containing all words with the stem “comput-”. In the Indonesian language, stemming is of crucial importance: words have prefixes, suffixes, infixes, and confixes that make matching related words difficult. In this paper, we investigate the performance of five Indonesian stemming algorithms through a user study. Our results show that, with the availability of a reasonable dictionary, the unpublished algorithm of Nazief and Adriani correctly stems around 93% of word occurrences to the correct root word. With the improvements we propose, this almost reaches 95%. We conclude that stemming for Indonesian should be performed using our modified Nazief and Adriani approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indexing the Indonesian Web: Language Identification and Miscellaneous Issues

Information retrieval tools and search engines have mainly been leveraging research results and technologies developed for the English language. In this paper we report the issues and obstacles we met in the process of designing and developing a search engine for the Indonesian language, as well as our progress and results. The results include original contributions such as a grammar for stemmi...

متن کامل

Intellectual Pilgrimages and Local Norms in Fashioning Indonesian Islam

Muslims living in the Indonesian archipelago have long placed considerable importance on their travels to and communications with what they saw as intellectual centers for the study of Islam. I trace some of the effects of these “intellectual pilgrimages” to Mecca, Cairo, and elsewhere on Indonesian deliberations about Islam, particularly concerning Islamic law. I argue that these references to...

متن کامل

Modified Grapheme Encoding and Phonemic Rule to Improve PNNR-Based Indonesian G2P

A grapheme-to-phoneme conversion (G2P) is very important in both speech recognition and synthesis. The existing Indonesian G2P based on pseudo nearest neighbour rule (PNNR) has two drawbacks: the grapheme encoding does not adapt all Indonesian phonemic rules and the PNNR should select a best phoneme from all possible conversions even though they can be filtered by some phonemic rules. In this p...

متن کامل

Lemmatization Technique in Bahasa: Indonesian Language

many researches and inventions have been made in the field of linguistics and technology. Even so, the integration between linguistics and technology is not always reliable to all language. Every language is unique in its linguistic nature and rules. In this paper, a lemmatization technique in Bahasa (Indonesian language) is presented. It has achieved good precision by using The Indonesian Dict...

متن کامل

Automatic Learning of Stemming Rules for the Indonesian Language

We present a method for the automatic learning of stemming rules for the Indonesian language. The learning process uses an unlabelled corpus. In the first phase the candidate (word, stem) pairs are automatically extracted from a set of online documents. This phase uses a dictionary but is nevertheless not trivial because of morphing. In the second phase the rules are induced from the thus obtai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005